There are 10 questions to this activity. Save your answers in word document that you will hand in on Moodle using a .pdf extension. Keep your script file in your GitHub folder and make sure that all changes are pushed to GitHub. Code for all questions should be clearly commented with the question number. You will include a link to your script file as a part of your final question in this activity. The assignment is worth 30 points.
Harmful Algal Blooms (HABs) occur when the conditions in a body of water allow for algae to rapidly grow. Some types of algae release toxins into the water as they grow resulting in HABs. HABs contaminate freshwater resources posing a public health concern. Swimming in a HAB may sicken people and consuming the water can sicken or kill animals and people. Algal blooms can also create dead zones in water bodies as the oxygen content of the water is lowered due to the abundance of photosynthestic activity. The presence of HABs can be observed with discolation of the water that may be blue-green, red, or brown and a scummy film on the water. Not all algae blooms may be harmful, but many health departments regularily test water to check for HABs especially when a bloom is present.
The discoloration of bodies of water from algal blooms can often be observed from satellites when the blooms are large enough. Below is a satellite image of algae in the western side of Lake Erie:
HABs often arise when conditions allow for higher levels of sunlight, slow moving water, and there are large inputs of nutrients like nitrogen and phosophorus. Residential and agriculutural areas often spread high amounts of fertilizer (contain nitrogen and phosophorus) to promote plant growth. These nutrients can get carried into bodies of water with runoff from rainfall. HABs are a particular concern in many lakes in upstate New York where beaches can close and warnings are issued to the public. In 2019, Verona Beach on Onieda Lake (Northeast of Syracuse) was frequently closed due to HABs. We will use satellite imagery to map this algal bloom and take a closer look at the landcover surronding the area. We will also look at the landcover (natural forest, agriculture, built environment) surronding Oneida Lake to see if there are patterns associated with the algal bloom.
You will look at the Oneida Lake area using images collected from the European Space Agency Copernicus Sentinel-2 satellite. Full documentation can be found at https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi and sentinel-2 collects data for 13 spectral bands ar differing spatial resolutions: four bands at 10 meters, six bands at 20 meters and three bands at 60 meters spatial resolution. You will work with all bands at 20 meters spatial resolution which means some bands have been aggregated to a courser spatial resolution and others have been downscaled to a finer spatial resolution. We will be working with the level 2A data that has the highest level of quality control and digital numbers represent bottom of atmosphere reflectance. These files are freely available online and typically cover a much larger geographic area. I have cropped the original files so that you can work with smaller files focused on the Oneida Lake area. You can find these files in the Sentinel folder within the data folder shared with you. We will be looking at images taken on August 14 2019 for 9 bands. As a part of the 2A data product, there is also a file that tells us the probability clouds are covering areas of the land surface.
Let’s load the packages that will be necessary for today’s analysis. Note that you will need to install mapview, caret, randomForest, and nnet packages.
library(raster)
library(sp)
library(rgdal)
library(rgeos)
library(mapview)
library(caret)
library(randomForest)
library(nnet)
We will also read in all 9 bands of data and the cloud file. I’ve also included data point that you will use later in the analysis for classifying landcover. I’ve set up a variable dirR
#set up directory for oneida data folder
dirR <- "c:\\Users\\hkropp\\Google Drive\\courses\\Spring2020\\Environmental Data Science\\data\\oneida"
#read in Sentinel data
rdatB2 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B02_20m.tif"))
rdatB3 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B03_20m.tif"))
rdatB4 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B04_20m.tif"))
rdatB5 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B05_20m.tif"))
rdatB6 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B06_20m.tif"))
rdatB7 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B07_20m.tif"))
rdatB8 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B08_20m.tif"))
rdatB11 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B11_20m.tif"))
rdatB12 <- raster(paste0(dirR,"\\sentinel\\T18TVN_20190814T154911_B12_20m.tif"))
clouds <- raster(paste0(dirR,"\\sentinel\\MSK_CLDPRB_20m.tif"))
#read in validation data
#here verbose=FALSE hiddes
algae <- readOGR(paste0(dirR,"\\Oneida\\algae.shp"), verbose=FALSE)
agri <- readOGR(paste0(dirR,"\\Oneida\\agriculture.shp"), verbose=FALSE)
built <- readOGR(paste0(dirR,"\\Oneida\\built.shp"), verbose=FALSE)
forest <- readOGR(paste0(dirR,"\\Oneida\\forest.shp"), verbose=FALSE)
water <- readOGR(paste0(dirR,"\\Oneida\\water.shp"), verbose=FALSE)
wetlands <- readOGR(paste0(dirR,"\\Oneida\\wetlands.shp"), verbose=FALSE)
Let’s take a closer look at the data. First, it will be helpful to make a stack of all of the rasters. Make a stack of red, green, and blue bands (B4,B3,B2) for plotting and all bands so we can work with them easily.
#stack red green blue
rgbS <- stack(rdatB4,rdatB3,rdatB2)
#stack all raster data
allbands <- stack(rdatB2,rdatB3,rdatB4,rdatB5,rdatB6,rdatB7, rdatB8,rdatB11, rdatB12,clouds)
Viewing a true color image will help us visualize the area. Note we have to set a scale for plotRGB. PlotRGB expects digital numbers to vary between 0-255. Sentinel digital numbers are equal to the reflectance*10,000. Here reflectance can exceed a value of one because there can be other sources of reflectance or interference that can affect the satellite measurements. Let’s take a look:
#view raster, maximum digigtal is around 20000 so set scale to that
plotRGB(rgbS, scale=20000)
The image looks very dark and dull making it difficult to distinguish objects on it. If we add the linear contrast stretch, we’ll get an image with much more variation in color, but it also looks oversaturated. The clouds to the north of the lake reflect a lot of light and that likely affects how the color over the land surface is displayed.
#view raster, maximum digigtal is around 20000 so set scale to that
plotRGB(rgbS, scale=20000, stretch="lin")
There is a package called mapview that will have many more contrast stretch options, allowing us to specify the quantiles of digital numbers to plot. It also can plot any type of data and include a range of basemaps including streetmaps and satellite imagery. However, this package is not always well maintained and seems to have and issues with some operating systems. You can see the output here in the tutorial, but if you do not get the package to run do not worry about it. This package is also nice because you can zoom in and out and select multiple layers. If you indicate the argument viewer.suppress=TRUE, then a html document will be generated that you can save and open anytime. You can also interact with the map. This is a nice package to consider using if you want to share maps that allow you to zoom or want a nice streetmap base layer.
#use mapview package
#view rgb and set up a contrast stretch, exclude clouds with high values
viewRGB(rgbS,r=1,g=2,b=3,maxpixels = 2297430, #view all pixels don' lower resolution
quantiles = c(0.00,0.995), #quantilesfor stretch. Cuts off high reflectance from clouds
homebutton=FALSE,
viewer.suppress=FALSE)#view in Rstudio
In this image, you can readily make out the built environment with roads and buildings, especially in the North Syracuse area. In the lake you can see swirls of bright green and that a large part of the water in the lake is bright green in the eastern portion of the lake. The western portion of the lake clearly has water without high algae concentrations and shows up as only a dark blue color. Now let’s look at the points in the shapefiles you read in earlier. These points were randomly selected and have been classified with landcover they fall under. Here algae is the areas of the lake with algae, agri is agriculture, built is any built environment (roads or buildings), forest is tree covered areas, water is open water with no algae, wetlands is marsh or wetland regions.
#use mapview package
#view rgb and set up a contrast stretch, exclude clouds with high values
#and view all landcover types
viewRGB(rgbS,r=1,g=2,b=3,maxpixels = 2297430,quantiles = c(0.00,0.995), homebutton=FALSE,
viewer.suppress=FALSE)+mapview(algae, color="grey25",col.regions="palegreen")+mapview(agri, color="grey25",col.regions="violet")+mapview(built, color="grey25",col.regions="darkgoldenrod3")+mapview(forest, color="grey25",col.regions="tan4")+mapview(water, color="grey25",col.regions="royalblue")+mapview(wetlands, color="grey25",col.regions="orangered2")
You will use these points to help classify the different types of landcover, including the water with the algae blooms. First, let’s exclude those cloud regions. If you plot the cloud mask, you will see that the cloud layer varies from 0 - 100 and represents the % probability of an area that is cloud. We can see some of our clouds are captured in the values above 60, but low percentages capture areas that are not clouds like the built environment.
plot(allbands[[10]])
We will want to filter these clouds out of the data so we don’t try to use them in the classification. Let’s filter out all values in the rasters in a cloud area above 60% so that they are changed to a NA otherwise the value stays the same. We’ll also exclude the clouds
#set clouds to NA
allbandsCloud <- list()
for(i in 1:9){
allbandsCloud[[i]] <- setValues(allbands[[i]],
ifelse(getValues(allbands[[10]])>60,NA,getValues(allbands[[i]])))
}
allbandsCloudf <- stack(allbandsCloud[[1]],allbandsCloud[[2]],allbandsCloud[[3]],allbandsCloud[[4]],allbandsCloud[[5]],allbandsCloud[[6]],allbandsCloud[[7]],allbandsCloud[[8]],allbandsCloud[[9]])
You can now see what all of the bands look like by plotting them all of the rasters at once. You will see the clouds will be greyed out as well. The clouds are not completely removed, but if we lower our threshold, we will start to remove non-cloud areas of the image. It’s ok for this analysis, but a more formal analysis would want to address this issue.
#view all layers
plot(allbandsCloudf)
We can also make a false color image with the near infrared, red, and green bands.Removing the clouds also brought the digital number down to 10,000. In the false color image, you can see more differences in the vegetation. The algae also no longer appears green. Keep in mind that R is an excellent analysis tool, but it doesn’t have a lot of options for viewing color images. Many people that work on making satellite imagery visually appealing rather than analyzing them often work with ArcGIS or Photoshop.
plotRGB(allbandsCloudf,r=4, g=3, b=2,
scale=10000,
stretch="lin",
margins=TRUE,
colNA="grey50")
In remote sensing, it is often helpful to classify objects in our image. Landcover describes the biological and physical land surface. Once the land area is classified, you can calculate area, zonal statistics, or use it in part of an analysis. However, classifying the type of landcover for every single pixel (all 2,297,430 pixels) would be time consuming. Here we will classify the land cover using the points that indicate different landcover classes and machine learning approaches. We will look at algae covered water and open water in addition to the different types of terrestrial land surfaces. There are many different approaches to conducting classifications. We will use two common types of machine learning approaches used for remote sensing: Neural Networks and Random Forest. We’ll compare how these methods make predictions and learn about the packages that run them.
You will use part of the points with identified landcover for training data and the other part to validate predictions of landcover. For the models, we need this data to be in a dataframe with the observations of remote sensing at each point and the xy coordinates for reference. First let’s designate which points will be used for training vs validation. There are 120 points in each shapefile, and you will use half of the points for training and half for validation. You will randomly select points to be training versus validation data. Anytime you are working data that involves the random generation of numbers, it is good to set a random seed. This allows for reproducibility in your script and will generate random numbers in the same way everytime you run your script. You can set your seed to be any number you like. Note that this only works if you run your script in the same order every time and don’t rerun a random number function over and over again in your console within the same session. If you want to ensure that any vector of random numbers is exactly the same everytime, it is always good to set the seed before each individual vector. Here, we’ll set the seed at the beginning and if you always run these vectors in exactly this order the samples will be the same every time.
#set seed so samples always the same
set.seed(12153)
The sample function will randomly select numbers from a discrete sequence of values. It is good for randomly selecting values in a vector. We will make a vector that indicates which of our our 120 points are going to be training vs validation data for each point.
#randomly select
algSamp <- sort(sample(seq(1,120),60))
#set up vector for data type
algData <- rep("train",120)
#randomly replace half of the data to be validating data
algData[algSamp] <- "valid"
waterSamp <- sort(sample(seq(1,120),60))
#set up vector for data type
waterData <- rep("train",120)
#randomly replace half of the data to be validating data
waterData[waterSamp] <- "valid"
agriSamp <- sort(sample(seq(1,120),60))
#set up vector for data type
agriData <- rep("train",120)
#randomly replace half of the data to be validating data
agriData[agriSamp] <- "valid"
builtSamp <- sort(sample(seq(1,120),60))
#set up vector for data type
builtData <- rep("train",120)
#randomly replace half of the data to be validating data
builtData[builtSamp] <- "valid"
forestSamp <- sort(sample(seq(1,120),60))
#set up vector for data type
forestData <- rep("train",120)
#randomly replace half of the data to be validating data
forestData[forestSamp] <- "valid"
wetlandsSamp <- sort(sample(seq(1,120),60))
#set up vector for data type
wetlandsData <- rep("train",120)
#randomly replace half of the data to be validating data
wetlandsData[wetlandsSamp] <- "valid"
Next you will set up the dataframe that includes all of the raster data for each landclass type. We will need the coordinates of the points, the label for whether a point will be used in training or validation, and an identifier of the landcover type. It would be helpful to have a table that keeps track of the landcover ID (landcID). I often prefer this method of keeping track of variable names rather than working with factors.
#create id table that gives each landcover an ID
landclass <- data.frame(landcID= seq(1,6),
landcover = c("algal bloom", "open water","agriculture","built","forest","wetlands"))
#set up table with coordinates and data type (validate or train) for each point
landExtract <- data.frame(landcID = rep(seq(1,6),each=120),
sampleType=c(algData,waterData,agriData,builtData,forestData, wetlandsData),
x=c(algae@coords[,1],water@coords[,1],agri@coords[,1],built@coords[,1],forest@coords[,1],wetlands@coords[,1] ),
y=c(algae@coords[,2],water@coords[,2],agri@coords[,2],built@coords[,2],forest@coords[,2],wetlands@coords[,2] ))
Next you will need to extract the raster values for each point. You’ll extract the coordinates into a dataframe from a raster layers.
#extract raster data at each point
#using point coordinates
rasterEx <- data.frame(extract(allbandsCloudf,landExtract[,3:4]))
#give names of bands
colnames(rasterEx) <- c("B2","B3","B4","B5","B6","B7","B8","B11","B12")
#combine point information with raster information
dataAll <- cbind(landExtract,rasterEx)
#preview
head(dataAll)
## landcID sampleType x y B2 B3 B4 B5 B6 B7 B8 B11 B12
## 1 1 valid 435221.7 4781719 325 531 283 436 264 243 185 26 15
## 2 1 valid 435318.6 4782049 319 544 288 437 285 259 189 26 20
## 3 1 valid 434626.4 4782052 298 454 245 352 221 209 151 24 25
## 4 1 valid 434615.3 4781600 280 432 231 306 195 197 136 29 14
## 5 1 valid 435432.9 4781163 321 536 286 420 259 255 174 22 21
## 6 1 train 435264.6 4781509 330 561 297 498 302 301 217 28 27
The final step in preparing the data is to seperate the training and validation data.
trainD <- dataAll[dataAll$sampleType == "train",]
validD <- dataAll[dataAll$sampleType == "valid",]
You will start with classification using the random forest method. There is a chapter of a brief overview of random forest.pdf posted in the data folder. This method can be used for classifications or continuous predictions. This method works well with data with a large number of features, generally performs well, and it can handle noisy data. However, tuning the model to the data can not always be straightforward. This is a commonly used prediction method for remotely sensed data.
For both methods (random forests and neural networks) today, we will use the caret package to help tune the parameters. The data folder also contains an overview of caret and parameter tuning in the caret.pdf file. For many machine learning methods, there are often parameters that need to be set to help configure the models and you can often choose from a range of values for parameters depending on the method you are using. Finding the parameter values that will work best for predictions can involve a lot of trial and error and setting arbitrary values. The Caret pacakge helps streamline this part of the model training process by finding the optimal parameter settings within a range of values. We will use caret to find the optimal parameter setting for our random forest predictions.
#Kfold cross validation
tc <- trainControl(method = "repeatedcv", # repeated cross-validation of the training data
number = 10, # number 10 fold
repeats = 10) # number of repeats
###random forests
#Typically square root of number of variables
rf.grid <- expand.grid(mtry=1:sqrt(9)) # number of variables available for splitting at each tree node
Here train control indicates how we want to set up our parameter tuning. We are using a repeated cross valdiation method. We also need to indicate the values to check for our random forests. You can see in the table below, there is one parameter we can tune in random forests:
You can see for random forests, there is one parameter that can be tuned in caret: mtry. This describes the number of variables sampled as candidates in splitting the decision trees that go into this method. Typically the default value for classification is the square root of the number of variables in the dataset (here we have 9 bands of the electromagnetic spectrum so we have 9 variables). However, you could also choose a value that is less than that (1 or 2 or 3) so we will tune the model over this grid of options. Now that we’ve set up this tuning we will train the model with our training dataset through the caret package.
# Train the random forest model to the Sentinel-2 data
#note that caret:: will make sure we use train from the caret package
rf_model <- caret::train(x = trainD[,c(5:13)], #digital number data
y = as.factor(trainD$landcID), #land class we want to predict
method = "rf", #use random forest
metric="Accuracy", #assess by accuracy
trainControl = tc, #use parameter tuning method
tuneGrid = rf.grid) #parameter tuning grid
#check output
rf_model
## Random Forest
##
## 360 samples
## 9 predictor
## 6 classes: '1', '2', '3', '4', '5', '6'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 360, 360, 360, 360, 360, 360, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.9429708 0.9312731
## 2 0.9404174 0.9281964
## 3 0.9388809 0.9263428
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
This process might take a minute to run on your computer. Once it finishes you can see the model checked accuracy and Kappa. Kappa is similar to accuracy but accounts for the level of accuracy that we might see just by random chance. Now that you have trained the model, you can make predictions for the entire raster of observations.Note that the raster names have to match your data names in the training dataset. You’ll have to change them first.
# Change name in raster stack to match training data
names(allbandsCloudf) <- c("B2","B3","B4","B5","B6","B7","B8","B11","B12")
# Apply the random forest model to the Sentinel-2 data
rf_prediction <- raster::predict(allbandsCloudf, model=rf_model)
#view predictions
plot(rf_prediction)
#landcover class names
landclass
## landcID landcover
## 1 1 algal bloom
## 2 2 open water
## 3 3 agriculture
## 4 4 built
## 5 5 forest
## 6 6 wetlands
This prediction looks pretty good initially. You can see most of the algae is predicted in the eastern portion of the lake. It is hard to dig into the predictions with this continuous raster legend. Let’s fix that on the map:
#set up categorical colors
landclass$cols <-c("#a6d854","#8da0cb","#66c2a5",
"#fc8d62","#ffffb3","#ffd92f")
#make plot and hide legend
plot(rf_prediction,
breaks=seq(0,6),
col=landclass$cols ,
legend=FALSE, axes=FALSE)
legend("bottomleft", paste(landclass$landcover),
fill=landclass$cols ,bty="n")
Let’s evaluate the predictions more formally using the validation dataset. First you have to extract the classifications from the raster of predictions for the cells where there is validation data points.
#get validation data from raster by extracting
#cell values at the cell coordinates
rf_Eval <- extract(rf_prediction, validD[,3:4])
You are ready to compare the predictions to the observed data. You will set up a confusion matrix to compare the observations to the predictions.
#make the confusion matrix
rf_errorM <- confusionMatrix(as.factor(rf_Eval),as.factor(validD$landcID))
#add landcover names
colnames(rf_errorM$table) <- landclass$landcover
rownames(rf_errorM$table) <- landclass$landcover
#view the matrix
rf_errorM$table
## Reference
## Prediction algal bloom open water agriculture built forest wetlands
## algal bloom 60 0 0 1 0 0
## open water 0 60 0 0 0 0
## agriculture 0 0 50 3 0 0
## built 0 0 5 51 0 0
## forest 0 0 0 0 58 1
## wetlands 0 0 5 0 2 59
#look at the overall accuracy
rf_errorM$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 9.521127e-01 9.425265e-01 9.244332e-01 9.718598e-01 1.690141e-01
## AccuracyPValue McnemarPValue
## 2.038345e-234 NaN
Now you will compare your random forest predictions to another common approach, neural networks. Neural networks can be used for classification or numerical predictions. Neural networks are often successful at classifications without needing to make a lot of assumptiosn about the data and capable of detecting complex patterns. However, this method can also be computationally intensive and prone to overfitting. First you’ll set up the parameter tuning in caret. You will fit the nnet function. This requires parameters size (number of nodes) of the hidden layer in your network topology. You will also need to specify the decay parameter will be used to prevent overfitting in the model training.
#set up grid
nnet.grid <- expand.grid(size = seq(from = 16, to = 28, by = 2), # number of neurons units in the hidden layer
decay = seq(from = 0.1, to = 0.6, by = 0.1)) # regularization parameter to avoid over-fitting
Now you are ready to train the model:
nnet_model <- caret::train(x = trainD[,c(5:13)], y = as.factor(trainD$landcID),
method = "nnet", metric="Accuracy", trainControl = tc, tuneGrid = nnet.grid,
trace=FALSE)
You will see a lot of print out if trace=TRUE is specified as the parameter tuning is done throughout the grid. This can be used to identify that you are running enough iterations and provide you with more information. You can find out more by printing out the results by changing this argument to FALSE. Now let’s view the training summary:
nnet_model
## Neural Network
##
## 360 samples
## 9 predictor
## 6 classes: '1', '2', '3', '4', '5', '6'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 360, 360, 360, 360, 360, 360, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 16 0.1 0.7176569 0.6605000
## 16 0.2 0.7149216 0.6596660
## 16 0.3 0.7553332 0.7058598
## 16 0.4 0.7828334 0.7401202
## 16 0.5 0.7762040 0.7314266
## 16 0.6 0.7546306 0.7059679
## 18 0.1 0.7554341 0.7074364
## 18 0.2 0.7624748 0.7163763
## 18 0.3 0.7864773 0.7437067
## 18 0.4 0.7802358 0.7363416
## 18 0.5 0.7734763 0.7296485
## 18 0.6 0.7895332 0.7472819
## 20 0.1 0.7601308 0.7120266
## 20 0.2 0.7887542 0.7461851
## 20 0.3 0.7658280 0.7194183
## 20 0.4 0.7960721 0.7549932
## 20 0.5 0.7772726 0.7323027
## 20 0.6 0.7737153 0.7292660
## 22 0.1 0.7587337 0.7107528
## 22 0.2 0.7583146 0.7106100
## 22 0.3 0.7702456 0.7240428
## 22 0.4 0.8095389 0.7709303
## 22 0.5 0.8042956 0.7647417
## 22 0.6 0.7625634 0.7155960
## 24 0.1 0.7638491 0.7170361
## 24 0.2 0.7776230 0.7331274
## 24 0.3 0.7774022 0.7336149
## 24 0.4 0.8022955 0.7624016
## 24 0.5 0.8112671 0.7734074
## 24 0.6 0.8043233 0.7647822
## 26 0.1 0.7842818 0.7410693
## 26 0.2 0.7761959 0.7314410
## 26 0.3 0.7958435 0.7550253
## 26 0.4 0.8016219 0.7614815
## 26 0.5 0.7908097 0.7488094
## 26 0.6 0.7973048 0.7569425
## 28 0.1 0.7629662 0.7160580
## 28 0.2 0.7866473 0.7440182
## 28 0.3 0.8342754 0.8008739
## 28 0.4 0.8092477 0.7707988
## 28 0.5 0.8186254 0.7823385
## 28 0.6 0.8101089 0.7727041
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 28 and decay = 0.3.
Make predictions for the entire image:
# Apply the neural network model to the Sentinel-2 data
nnet_prediction <- raster::predict(allbandsCloudf, model=nnet_model)
#make plot and hide legend
plot(nnet_prediction,
breaks=seq(0,6),
col=landclass$cols ,
legend=FALSE)
legend("bottomleft", paste(landclass$landcover),
fill=landclass$cols ,bty="n")
Now you can check the model predictions:
#extract predictions
nn_Eval = extract(nnet_prediction, validD[,3:4])
#confusion matrix
nn_errorM = confusionMatrix(as.factor(nn_Eval),as.factor(validD$landcID))
colnames(nn_errorM$table) <- landclass$landcover
rownames(nn_errorM$table) <- landclass$landcover
nn_errorM$table
## Reference
## Prediction algal bloom open water agriculture built forest wetlands
## algal bloom 60 0 0 0 0 0
## open water 0 60 0 0 0 0
## agriculture 0 0 35 6 5 4
## built 0 0 9 49 0 0
## forest 0 0 9 0 43 4
## wetlands 0 0 7 0 12 52
nn_errorM$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.422535e-01 8.106937e-01 8.000950e-01 8.785865e-01 1.690141e-01
## AccuracyPValue McnemarPValue
## 4.298002e-170 NaN
You will notice that the neural network does not predict the agriculture very well. This may because some agricultural bare fields are brown in color while others can be green. Our data may not be able to differentiate our definition of agriculutral fields very well. On rare occasions, running this neural network may not even predict any areas to be agricultural and you may get a warning when your confusion matrix is generated. Even though you are tuning the parameters over a grid, you are looking at a narrow range to allow this to run faster. In preparing this exercise, I spent some time finding the range that worked best without having you run too many search options. When predictions are not very good for a class, tuning the grid of parameter values can help. Let’s compare the maps of the two methods side by side.
par(mfrow=c(2,1), mai=c(0,0,0,0))
#random forest
plot(rf_prediction,
breaks=seq(0,6),
col=landclass$cols ,
legend=FALSE)
#legend
legend("bottomleft", paste(landclass$landcover),
fill=landclass$cols ,bty="n")
#add title
mtext("Random Forest", side=3,cex=2, line=-5)
#neural network
plot(nnet_prediction,
breaks=seq(0,6),
col=landclass$cols ,
legend=FALSE, axes=FALSE)
#add legend
legend("bottomleft", paste(landclass$landcover),
fill=landclass$cols, bty="n")
#add title
mtext("Neural network", side=3,cex=2, line=-5)
You set out to do this analysis to see if you could classify the area of water experiencing an algal bloom and examine the surrounding landcover.Now let’s assess the conclusions and potential bias in interpreting these classifications. Let’s start by comparing the area of land predicted to be an algal bloom. You can check the number of cells in each class using the freq function in the raster package
#cell count neural net
freq(nnet_prediction)
## value count
## [1,] 1 310401
## [2,] 2 203077
## [3,] 3 467316
## [4,] 4 178778
## [5,] 5 888924
## [6,] 6 244535
## [7,] NA 4399
#cell count random forest
freq(rf_prediction)
## value count
## [1,] 1 243704
## [2,] 2 290326
## [3,] 3 410477
## [4,] 4 181132
## [5,] 5 872256
## [6,] 6 295136
## [7,] NA 4399
Next it would be helpful to compare differences in predictions between methods. It would be useful to see areas where there is the largest disagreement in outcomes.